Day 17 羅吉斯迴歸 Logistic Regression 實作篇

2022 iThome 鐵人賽

DAY 17

AI & Data

語言學與NLP系列第 17 篇

14th鐵人賽 # day 17 # logistic regression # r # python

cjom06991

團隊KnULPers_from_NCCU

2022-10-02 18:17:35

3512 瀏覽

分享至

昨天介紹完 logistic regression 之後，今天當然要來實作一下了！今天的實作一樣會分成兩部分，上半部分為 Python，下半部分 R 的簡單實作練習。那麼，廢話不多說，我們直接開始吧！

Python Logistic Regression 實作

本篇，我們要建立糖尿病預測模型。使用 logistic regression classifier 預測糖尿病。我們先從 kaggle 下載 Pima Indian Diabetes 資料集(https://www.kaggle.com/uciml/pima-indians-diabetes-database)，再使用 pandas 讀取 Pima Indian Diabetes。


#import pandas
import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("/content/diabetes.csv", header=None, names=col_names)
pima = pima.iloc[1: , :] # drop the first row because we've changed the column names
pima.head()

執行結果為：

接下來，我們需要將給定的列分為依變量（或目標變量）和自變量（或特徵變量）兩種類型。


#split dataset in features and target variable

feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

分為 train set & test set，random_state 大致上等於 set.seed（隨機種子）


# split X and y into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=16)

建立模型


# import the logistic regression
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16)

# fit the model with data
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

建立 confusion matrix


# import the metrics 
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

執行結果為：

array([[116, 9],
[ 26, 41]])

用 heat map 來看 confusion matrix


# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

執行結果為：

看一下整個模型的 performance metrics (accuracy, precision, recall & F1-score)


from sklearn.metrics import classification_report
target_names = ['without diabetes', 'with diabetes']
print(classification_report(y_test, y_pred, target_names=target_names))

執行結果為：

pmp

R Logistic Regression 實作



library(text2vec)
library(data.table)
library(magrittr)
data = read.csv("/Users/biaoyun/Documents/Ithome/diabetes.csv")

head(data)

執行結果為：

dbr

分 train $ test set


library(caret)

set.seed(16)

trainIndex <- createDataPartition(data$Outcome, p=0.8, list=FALSE)
train_set <- data[trainIndex,]
test_set <- data[-trainIndex,]


library(glmnet)

NFOLDS = 10 # k-folds cross validation

glmnet_classifier = cv.glmnet(as.matrix(train_set), y = train_set$Outcome, 
                 family = 'binomial', 
                 alpha = 1,
                 type.measure = "auc",
                 nfolds = NFOLDS,
                 thresh = 1e-3,
                 maxit = 1e3)




preds = predict(glmnet_classifier, as.matrix(test_set), type = 'response')[,1]
glmnet:::auc(test_set$Outcome, preds) # using accuracy as the evaluation

執行結果為：

[1] 1

Confusion matrix & Performance metrics


assigner <- function(prediction){
  pred_class = c()
  for (i in seq_along(prediction)){
    if(prediction[i]>0.37){
      pred_class[i] <- 1
    }else{
      pred_class[i] <- 0
    }
  }
  return(pred_class)
}

confusionMatrix(as.factor(assigner(preds)),as.factor(test_set$Outcome))

執行結果為：

pmr